CAGEF_services_slide.png

Advanced Graphics and Data Visualization in R

Lecture 01: "R"-efresher on R and best practices testing


0.1.0 An overview of Advanced Graphics and Data Visualization in R

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 datasets, the techniques learned herein will be broadly applicable.

This lesson is the first in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will be your crash-course on Jupyter notebooks and R to refresh on packages and principles that will be relevant throughout our course. In our lectures and your assignments we will be working with some uncurated data to simulate the full experience of working with data from start to finish. It's important that we are all familiar with, and understand the majority of the tidy data methods that we'll be using in class so that we can focus on the new material as it appears. We'll use some standard packages and practices to finesse our data before visualizing it, so let's R-efresh ourselves.

At the end of this lecture we will have covered the following topics:

  1. Working with Jupyter notebooks and best coding practices.
  2. R data types, objects and working with them.
  3. Long-format and tidy data principles using the tidyverse package.
  4. Basic control flow and plotting.

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on epidemiological data from the Ontario provincial government found here.

0.4.1 Dataset 1: Ontario_daily_change_in_cases_by_phu.csv

This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 cases throughout different public health units in the province. It is in a comma separate format and has been collected since 2020-03-24.

0.4.2 Dataset 2: Ontario_covidtesting.csv

This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 throughout the province. It is in a comma separated format and has been growing/expanding since initial tracking started on 2020-01-26.


0.5.0 Packages used in this lesson

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

lubridate and zoo are helper packages used for working with date formats in R

Let's run our first code cell!


1.0.0 Coding in Jupyter Notebooks

Work with your Jupyter Notebook on the University of Toronto JupyterHub will all be contained within a new browser tab with the address bar showing something similar to

https://jupyter.utoronto.ca/user/assigned-username-hexadecimal/tree/2022.03_Adv_Graphics_R

All of this is running remotely on a University of Toronto server rather than your own machine.

You'll see a directory structure from your home folder:

ie \2022.03-Adv_Graphics_R\ and a folder to Lecture_01_R_Introduction within. Clicking on that, you'll find Lecture_01.R-efresher.skeleton.ipynb which is the notebook we will use for today's code-along lecture.


1.1.0 Why is this class using Jupyter Notebooks?

We've implemented the class this way to reduce the burden of having to install various programs. While installation can be a little tricky, it's really not that bad. For this introduction course, however, you don't need to go through all of that just to improve on your data visualization skills.

Jupyter Notebooks also give us the option of inserting "markdown" text much like what you're reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.

There is, however an appendix section at the end of this lecture detailing how to install Jupyter Notebooks (and the R-kernel for them) as well as independent installation of the R-kernel itself and a great integrated development environment (IDE) called RStudio.


1.2.0 Packages contain useful functions that we'll use often

So... what are in these packages? A package can be a collection of

Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).

In this course we will frequently rely on a package called tidyverse which is also composed of a series of other packages we can use to reformat our data like readr, dplyr, tidyr and stringr.


1.3.0 Jupyter notebooks run programming language kernels like R

Behind the scenes of each Jupyter notebook a programming kernel is running. For instance, depending on setup our notebooks can run a true or "emulated" R-kernel to interpret each code cell as if it were written specifically for the R language.

As we move from code cell to new code cell, all of the variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!

There are some options in the "Cell" menu that can alleviate these problems such as "Run All Above". If you think you've made a big error by overwriting a key object, you can use that option to "re-initialize" all of your previous code!

The run order of your code is also visible at the side of each code cell as [x]. When a code cell is still actively running it will be denoted as [*] since a number cannot be assigned to it. You'll also notice your kernel (top right of the menu bar) has a small circle that will be dark while running, and clear while idle.

Remember these friendly keys/shortcuts:

In Command mode


1.3.1 Why would you want to use a Jupyter Notebook?

Depending on your needs, you may find yourself doing the following:

Jupyter allows you to alternate between "markdown" notes and "code" that can be run or re-run on the fly.

Each data run and it's results can be saved individually as a new notebook or as new cells to compare data and small changes to analyses!


1.4.0 Following best practices for coding will make life easier

Let's discuss some important behaviours before we begin coding:

1.4.1 Annotate your code with the # symbol

Why bother?

Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?

You can annotate your code for selfish reasons, or altruistic reasons, but please take the time to annotate your code.

How do I start?

Comments may/should appear in three places:


# Example commenting section
# At the beginning of the script, describing the purpose of your script and what you are trying to solve

bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for. 

#---------- Section dividers helps organize code structure ----------#
## Feel free to add extra hash tags to visually separate or emphasize comments

Maintaining well-documented code is also good for mental health!


1.4.2 Naming conventions for files, objects, and functions in R

Stylistically, you have the following options:

The most important aspects of naming conventions are being concise and consistent! Throughout this course you'll see a hybrid system that uses the underscore to separate words but a period right before denoting the object type ie this_data.object.


1.4.3 Best Practices for Writing Scripts


1.5.0 Trouble-shooting basics

We all run into problems. We'll see a lot of mistakes happen in class too! That's OK if we can learn from our errors and quickly (or eventually) recover.

1.5.1 Determine the location and type of error

Usually when R generates an error it will produce some information about what has happened. This usually includes an error message detailing the kind of error it encountered or an error message generated by the function. It can also include a line where the error was encountered, or the name of the last function that was called before the error was encountered.

1.5.2 Common errors

quote-the-answers-are-all-out-there-we-just-need-to-ask-the-right-questions-oscar-wilde-123-30-66.jpg


1.5.3 Finding answers online

1.5.3.1 Asking a question in an online forum

Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.

Last but not least, to make life easier: Under the Help pane, there is a cheatsheet of Jupyter notebook keyboard shortcuts or a browser list here.


backToTheFoundationsOfR-logo-png-transparent.png
Before we learn to run, let's review what it means to walk with the foundations of R.

2.0.0 Foundations of R

There are many tips and tricks to remember about R but here we'll quickly recall some foundational knowledge that could be relevant in later lectures.

2.1.0 Assigning variables

If we want to hold on to a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!

-> and ->> Rightward assignment: we won't really be using this in our course.

<- and <<- Leftward assignment: assignment used by most 'authentic' R programmers but really just a historical keyboard throwback.

= Leftward assignment: commonly used token for assignment in many other programming languages but holds dual meaning!

Notes


2.2.0 Data types are the basic building blocks of R

Data types are used to classify the basic spectrum of values that are used in R. Here's a table describing some of the common data types we'll encounter.

Data type Description Example
character Can be single or multiple characters (strings) of letters and symbols. Assigned using double ' or " a#c&E
integer Whole number values, either positive or negative 1
double Any number that is not an integer 7.5
logical Also known as a boolean, representing the state of a conditional (question) TRUE or FALSE
NA Represents the value of "Not Available" usually seen when imported data has missing values NA

2.2.1 Data structures hold single or multiple values

The job of data structures is to "host" the different data types. There are five basic types of data structures that we'll use in R:

Data structure Dimensions Restrictions
vector 1D Holds a single data type
matrix 2D Holds a single data type
array nD Holds a single data type
data frame 2D Holds multiple data types with some restrictions
list 1D (technically) Holds multiple data types AND structures
data_structures.jpg
Sometimes it is helpful to imagine Data Structures as real-world objects to understand how they are shaped and related to each other.

2.2.2 Vectors are like a queue of a single data type


2.2.2.1 Coercion changes data from one type to another (where applicable)

R will implicitly force (coerce) your vector to be of one data type, in this case the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.

Importantly, when coercing, the R kernel converts from more specific to general types usually in this order:


logical $\rightarrow$ integer $\rightarrow$ numeric $\rightarrow$ complex $\rightarrow$ character $\rightarrow$ list.

2.2.3 Data Frames hold tabular data

2.2.3.1 Object classes

Now that we have had the opportunity to create a few different vector objects, let's talk about what an object class is. An object class can be thought of as a structured with attributes that will behave a certain way when passed to a function. Because of this

Some R package developers have created their own object classes. For example, many of the functions in the tidyverse generate tibble objects. They behave in most ways like a data.frame but have a more refined print structure, making it easier to see information such as column types when viewing them quickly. In general, from a trouble-shooting standpoint, it is good to be aware that your data may need to be formatted to fit a certain class of object when using different packages.

After we are done tidying most of our datasets, they will be in tibble objects, but all of the basic data frame functions apply to these as well.


2.2.3.2 Data frames are groups of vectors aligned as columns

While matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames treat each column of the structure like a vector. The data frame, however, can have multiple data types mixed across each different column. Data frame rules to remember are:

  1. Within a column, all members must be of the same data type (ie character, numeric, Factor, etc.)
  2. All columns must have the same number of rows (hence the matrix shape)

Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.


2.2.3.3 Some useful data frame commands (for now)

There are many more ways to access and manipulate data frames that we'll explore further down the road. Let's review some basic data frame code.


2.2.4 Lists are amorphous bundles strung together with code

Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types to pass around your scripts, and functions, or when receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!

If you forget the contents of your list, use the str() function to check out its structure. str() will tell you the number of items in your list and their data types.


2.2.4.1 Accessing elements from a list is accomplished in multiple ways

Accessing lists is much like opening up a box of boxes of chocolates. You never know what you're gonna get when you forget the structure!

You can access elements with a mixture of number and naming annotations much like data frames. Also [[x]] is meant to access the xth "element" of the list. Note that unnamed lists cannot be accessed with naming annotations.


Comprehension Question 2.2.4.1: Suppose we had a list named multiDF.list consisting of 3 data frames, as shown in the following code cell. How would you subset the 2nd and 3rd data frames into their own list? How would you access the "values" column from the 3rd data frame? Use the following code cell to help you out.

2.3.0 Factors codify your data into categorical variables

Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake. Adding or changing data in a data frame with pre-existing factors requires that you match factor levels correctly as well.

Factors make perfect sense if you are a statistician designing a programming language (!) but to everyone else they exist solely to torment us with confusing errors. At its core, a factor is really just an integer vector or character data with an additional attribute, called levels(), which defines the accepted values for that variable.

2.3.0.1 Why use factors?

Why not just use character vectors, you ask?

Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.

2.3.0.2 A historical note about R 4.0.x versus r 3.x.x

Since the inception of R, data.frame() calls have been used to create data frames but the default behaviour was to convert strings (and characters) to factors! This is a throwback to the purpose of R, which was to perform statistical analyses on datasets with methods like ANOVA which examine the relationships between variables (ie factors)!

As R has become more popular and its applications and packages have expanded, incoming users have been faced with remembering this obscure behaviour, leading to lost hours of debugging grief as they wonder why they can't pull information from their dataframes to do a simple analysis on C. elegans strain abundance via molecular inversion probes in datasets of multiplexed populations. #SuspciouslySpecific

That meant that users usually had to create data frames including the toggle

data.frame(name=character(), value=numeric(), stringsAsFactors = FALSE)

Fret no more! As of R 4.x.x the default behaviour has switched and stringsAsFactors = FALSE is the default! Now if we want our characters to be factors, we must convert them explicitly, or turn this behaviour on at the outset of creating each data frame!


2.3.1 Specify factors and their levels explicitly during or after data.frame creation

From above, you can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.

R's default behaviour puts factor levels in alphabetical order. This can cause problems if we aren't aware of it. You can check the order of your factor levels with the levels() command. Furthermore you can specify, during factor creation, your level order.

Always check to make sure your factor levels are what you expect.

With factors, we can deal with our character levels directly, or their numeric equivalents.


2.3.2 More facts about factors

  1. Use levels() to list the levels and their order for your factor

  2. To rename levels of a factor, declare and reassign your factor.

  3. Move a single level to the first position within your factor levels with relevel().

  4. Factor levels can be assigned an order of precedence during their creation with the parameter ordered = TRUE.

  5. Define labels for your factor during their creations with the parameter labels = c(). Note that level order is assigned before labels are added to your data. You are essentially labeling the integer assigned to your factor levels so be careful when using this parameter!

Advanced factors functions with forcats If you're looking for more advanced functions that you can use to manipulate, sort or update factors, check out the forcats function. With it, you can refactor based on functions, frequency, or explicitly re-specify the order of one or more factor levels.

matrix-addition.gif

2.4.0 Mathematical operations on data frames and arrays

Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!

2.4.1 Mathematical operations are applied differently depending on data type

Therefore be careful to specify your numeric data for mathematical operations.


2.5.0 Using the apply() family of functions to perform actions across data structures

The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.

2.5.1 The apply() function will recognize basic functions and use them on vectorized data

The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.

For example, we might have a count table where rows are genes, columns are samples, and we want to know the sum of all the counts for a gene. To do this, we can use the apply() function. apply() Takes an array, matrix (or something that can be coerced as such, like a numeric data frame), and applies a function over rows or columns. The apply() function takes the following parameters:

and returns a vector, array or list depending on the nature of X.

Let's practice by invoking the sum function.


2.5.2 The other members of the apply() family

There are 3 additional members of the apply() family that perform similar functions with varying outputs

  1. lapply(data, FUN, ...) is usuable on dataframes, lists, and vectors. It returns a list as output.
    • It will coerce non-list objects to a list
    • Additional arguments to FUN will be applied from the ...
  1. sapply(data, FUN, ...) works similarly to lapply() except it tries to simplify the output to the most elementary data structure possible. i.e. it will return the simplest form of the data that makes sense as a representation.
  1. mapply(FUN, data, ...) is short for "multivariate" apply and it applies a function to multiple lists or multiple vector arguments.

Notice how in using sapply() to extract from a list of data frames, a single matrix was returned - a single output in the simplest form that maintains structure.

Now let's give mapply() a try.


not_available.png

2.6.0 Special data: NA and NaN values

Missing values in R are handled as NA (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, specially NAs, have special ways to be dealt with, otherwise it may lead to errors in some functions.

For our purposes, we are not interested in keeping NA data within our datasets so we will usually detect and remove them or replace them within our data after it is imported.

2.6.1 Helpful functions and information for dealing with NA data

  1. is.na() returns a logical vector reporting which values from your query are NA.
  2. complete.cases() returns a logical for row without any NA values.
  3. Some functions can ignore NA values with the na.rm = TRUE parameter: ie mean(), sum() etc.
  4. Additional functions in the tidyr package can also be used to work with NA values.

tidydata_2.jpg https://cfss.uchicago.edu/notes/tidy-data/


3.0.0 Welcome to the tidyverse

Let's begin with some definitions:

latrines_wide_to_long.png

In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data. To read more about wide and long formats, visit here.

Why tidy data?

Data cleaning/wrangling (or dealing with 'messy' data) accounts for a huge chunk of a data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that standardized data structure can help this process along.

In Tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Every cell is a single value.

This seems pretty straightforward, and it is. It is the datasets you get that will not be straightforward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.

3.0.1 The 5 most common problems with messy datasets are:

Observational units: Of the three rules, the idea of observational units might be the hardest to grasp. As an example, you may be tracking a puppy population across 4 variables: age, height, weight, fur colour. Each observation unit is a puppy. However, you might be tracking the same puppies across multiple measurements - so a time factor applies. In that case, the observation unit now becomes puppy-time. In that case, each puppy-time measurement belongs in a different table (at least by tidy data standards). This, however, is a simple example and things can get more complex when taking into consideration what defines an observational unit. Check out this blog post by Claus O. Wilke for a little more explanation.

Let's begin this journey with data import.


3.1.0 Opening and saving files with the readr package - "All roads lead to Rome.."

... but not all roads are easy to travel.

Depending on format, data files can be opened in a number of ways. The simplest methods we will use involve the readr package as part of the tidyverse. These functions have already been developed to simplify the import process for users. The functions we will use most often are:

Let's read in our first dataset so that we can convert from wide to long format.


3.1.1 Our SARS-CoV-2 public health unit data covers 34 regions

From looking at our data public health unit data, we can see that it begins tracking on 2020-03-24 and goes up until 2022-02-28. In total there are observations for 707 days across 34 public health units. The final column appears to be a tally running for total cases reported on that date.

From the outset, we can see there are some issues with the data set that we'll want to resolve and we'll work through some tidyverse functions in order to do that. First let's quickly review some of the potential problems with our dataset.

  1. There are 34 public health units and a total count for each date. It is preferable for data visualization to collapse all of those public health units into a single variable so that we have a single value new_cases for each Date observation. At the same time we will not collapse Total into that same variable.
  1. The data is rife with NA values. Many instance are likely due to no data being collected on those dates. For our purposes, it may be simpler to replace them with a value of 0.
  1. Our public health unit names are clunky. We should trim them down to simpler region names.

In the end, we want to convert our data to look something like this:

date \<date> total_phu_new \<dbl> public_health_unit \<fct> new_cases \<dbl>
2020-03-24 0 Algoma 0
2020-03-24 0 Brant County 0
2020-03-24 0 Chatham-Kent 0
... ... ... ...

Before we tackle these issues, let's go ahead and review some of the tools at our disposal.


3.2.0 The tidyverse package and it's contents make manipulating data easier

While the tidyverse is composed of multiple packages, we will be focused on working with a subset of these: dplyr, tidyr, and stringr.

3.2.0.1 Redirect your output with %>% whenever you can!

To save on making extra variables in memory and to help make our code more concise, we should use of the %>% symbol. This is a redirction or pipe symbol similar to the | in Unix operating systems and is used for redirecting output from one function to the input of another. By thoughtfully combining this with other commands, we can alter or query our datasets with ease.

We'll also introduce the %<>% in this class. This is a little more advanced but it allows us to assign the final product of our chain of commands to the very first object.

Whenever we are redirecting, we are implicitly passing our output to the first parameter of the next function. We may not always want to use the entirety of the output or we may want to also reuse that redirected output as part of another parameter. To do so we can use . to explicitly denote the redirected output.

3.2.0.2 dplyr has functions for accessing and altering your data

We will use the "verbs" of the dplyr function often to massage the look of our data by changing column names or subsetting it. The most common verbs you will see in this course are.

Function(s) Description
arrange() Arranging rows by column values
count(), tally() Counting observations by group
distinct() Subsetting rows by distinct or unique values
filter() Subsetting rows by column values
mutate(), transmute() Create, modify, or delete columns
select() Subset columns using their names and types
summarize() or summarise() Summarize by groups to fewer rows
group_by() vs. ungroup() group by one or more variables
rename(), and relocate() Rename or move columns

3.2.0.3 tidyr has additional functions for reshaping our data

The tidyr package will be most useful when we are trying to reshape our data from the wide to the long format or vice versa. This is much more useful for when we want to drastically alter portions or all of our data.

Function(s) Description
pivot_longer() Pivot data from wide to long
pivot_wider() Pivot data from long to wide
extract() Extract a character column into multiple groups
separate() Separate a character column into multiple groups
unite() Unite multiple columns into one by pasting strings
drop_na() Drop rows containing missing values
replace_na() Replace NAs with specific values

3.2.0.4 stringr provides functionality for searching data based on regular expressions

The stringr package will come in most useful when we are trying to fix string issues with our data. Many time our headers or data will contain spaces or poor formatting. Many times we will prefer to have our headers in lower case format, with any spaces replaced by an _. We'll also use verbs from this package to make any variables or data more concise.

Category Function(s) Description
String analysis str_count() Count the number of matches in a string
String retrieval str_detect() Detect the presence (or absence) of a pattern in string
str_extract() and str_extract_all() Extract matching patterns from a string
str_match() and str_match_all() Extract matched groups from a string
str_subset() and str_which() Keep or find strings matching a pattern
String alteration str_remove() and str_remove_all() Remove matched patterns from a string
str_split(), str_split_fixed(), and str_split_n() Split a string into pieces
str_c() Concatenate multiple strings into a single string with optional separator
str_flatten() Flatten a string
str_sub() Extract and replace substrings from a character vector
str_to_upper() and str_to_lower() Convert case of a string

GoT_Examples_coming.png
Time to tackle our dataset!

3.2.1 Reformat our wide table with pivot_longer()

As you may recall, our PHU data is formatted such that each column represents new cases per day for a single PHU. It's a great way to format for data entry and certainly reduces on redundancy. However, for us to work with this data, we want to collapse all of those PHUs into a single column.

Today we will use the pivot_longer() function to convert our wide-format data over to long-format. For our purposes, we will rely on four parameters:

  1. data: the data frame (and columns) that we wish to transform.
  2. cols: the columns that we wish to gather/collapse into a long format.
  3. names_to: the variable name of the new column to hold the collapsed information from our current columns.
  4. values_to: The variable name of the values for each observation that we are collapsing down.

We'll be using a series of %>% so for now we won't save our work to a new object.


3.2.2 Replace NA values from our data with replace_na()

Our conversion to long format creates 14,038 observations relating a Date to a new_cases value in a specific Public_Health_Unit (or total). From the looks of our data, however, we have a number of NA values under our new_cases variable.

We have two options:

  1. Remove the NA observations from our data set. There won't be any loss of information since we could rebuild the original data if we really needed to.
  2. Replace the NA observations with a value that makes sense for our analysis.

Let's replace the missing observations with a new value, 0, using replace_na(). This function will need two parameters:

  1. data: the data frame or vector that it will scan for NA values.
  2. replace: the value that we will use to replace NA.

We're going to update our pipe of commands and save the final output into a new variable covid_phu_long.df.


3.2.3 Reformat our public health unit names with str_replace_all()

Looking at our PHU names, we can see that there is a lot of redundancy in our names. We see they end in some form of:

We have a couple of choices but we can either use str_replace_all() or a specific version of that, str_remove_all(), which simply replaces a pattern with an empty character.

For str_replace_all() we will supply:

  1. string: a single string or vector of strings.
  2. pattern: the pattern we wish to search for in the form of a string or regular expression.
  3. replace: the replacement string we wish to use.

We also see the odd "," here but we'll leave those and actually perform a second replacement on the updated strings and convert all of the underscores (_) with spaces. To wrap that up we'll convert our updated variable to a factor and overwrite our original covid_phu_long.df.

We will accomplish this all through multiple calls to mutate.


3.2.4 rename() variables for clarity

Now that we have the basic structure for our data, we want to clean it up just a little bit by renaming our Total column to clarify that it represents total new cases across all PHUs for that date. Why did we keep this column separate? Now we can use this information to generate percentage totals for each PHU if we choose to. We'll also change our Date column to lower case at the same time.

We'll use rename() from dplyr to accomplish the task of renaming our column. There are a number of ways you could accomplish this without using dplyr but the simplicity of it is nice.


3.2.5 Reorder your columns with relocate()

The last cleanup we can accomplish with our data is to move total_phu_new to the last column of our data frame. This is for personal preference but also makes more sense when simply looking at the data. The relocate() verb from dplyr accomplishes this with ease since we are not dropping or removing columns. It uses some extra syntax to help accomplish its functions:

  1. .data: the data frame or tibble we want to alter
  2. ...: the columns we wish to move
  3. .before or .after: determines the destination of the columns. Supplying neither will move columns to the left-hand side.

In fact, relocate() can be used to rename a column as well but it will also be moved by default so consider the ramifications of such an action!

Note: We could accomplish a similar result using the select command as well. It's really up to what you're comfortable with but it is much simpler to use relocate() when you are working with a large number of columns and you want to move one to a specific location.


Comprehension Question 3.2.5: In the above example we used the relocate() function to move the "total_phu_new" column to the end of our data frame. What other methods could we use to accomplish the same feat? Use the below code cell to help yourself out.

3.3.0 Save your data to a file - "Country roads... save to home!"

At this point we have completed the data wrangling we want to accomplish on this dataset. We've converted it to a long-format and renamed the PHU entries while removing any NA values that may cause issues. There are a number of ways we could save this data now either as a text file or in its current form as a data frame in a .RData format.

Let's try some of those methods now.


3.3.0.1 readxl and writexl packages for working with excel spreadsheets

Not all of your data may come as a comma- or tab-delimited format. In the case of excel spreadsheets there are some packages available that can also facilitate the parsing of these more complex files. The readxl package is part of the tidyverse but writexl package is not. There are other means of writing to an excel file format but they are dependent on other programs (like Java or Excel) or their drivers.

From the readxl package

From the writexl package (not a part of the tidyverse) but independent of Java and Excel


4.0.0 Simple graphical analysis of data with ggplot2

We now have some data in a tidy format that we'd like to visualize. We can begin with some initial analyses of the data using the ggplot2 package. It has all of the components we need to help us decide on which data we want to focus on or keep. There are a number of way to visualize our data and here we will refresh our ggplot skills.

Basic ggplot notes:


4.1.0 Make a line graph of new cases based on each PHU across all dates

We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We'll begin with a simple line graph of all the public health units across all dates within the set.

In order to update or add layers to a ggplot object, we can use the + symbol for each command. For instance, to define the source of x-axis and y-axis data, we use aes() command to update the aesthetics layer. Remember how we defined the public_health_unit variable as a factor? We'll take advantage of that here and tell ggplot to give each PHU it's own colour.

After defining our aesthetics, we still need to tell ggplot how to actually graph the data. The ggplot package comes with an abundance of visualizations accessed through the geom_*() commands. Some examples include


4.2.0 Use the facet_wrap() command to break PHUs into separate graphs

There's a lot of data on that graph and some of it is quite drowned out because of the scale of PHUs with many more cases. To break out each PHU individually, we can add the facet_wrap() command. We'll also update some of the parameters:

At the same time, we'll also get rid of the legend since each individual graph will be labeled by its PHU.


4.3.0 Use the ggsave() command to save your plots to a file

There are a number of ways you can use the ggsave() command to specify how you want to save your files.


4.4.0 Barplots can be used to summarize your data across PHUs

Although we do have a running total for each date, what if we want to look at the totals cases across subsets of the PHUs? Using a barplot we can stack cases by date and get a sense of daily case totals from which sets of PHUs we desire.

This time we will use geom_bar() to display our data and tell it to use the values from our new_cases variable to generate the totals. We do this by setting the stat = "identity" parameter.

At the same time, let's update our colours to use a colour-blind friendly palette scheme.


4.4.1 Alter your bin widths to monthly totals by transforming your x-axis

From above we get a sense of overall totals for some PHU distributions but it's still too much to look at. Let's transform our x-axis values so we can bin by months instead. To accomplish this we'll use the as.yearmon() function found in the zoo package we loaded at the beginning of the lecture.


4.5.0 Filter your data for what you want to display

Now that we have taken an initial look at our data, we can see that even after converting our axis to a month-year format, it appears that some of the data isn't that relevant for us. Some of the PHUs are not generating many new cases per day so we can now consider slicing our data up to look at specific regions.

Let's look at the top 10 regions by total caseload across the dataset.


4.5.1 Use the filter() command to make a subset of our data

Now that we have a list of PHUs ordered by descending total cases, we can use that to filter our covid_phu_long.df dataframe and graph only the more heavily infected PHUs. We can then pipe the filtered data over to make a ggplot() object. At the same time we'll do a few more things:

  1. Reorder our factors so that the bars and legend display the PHUs in ascending order by new cases.
  2. Alter the plot title to reflect the data we are using.

4.6.0 Looking at the effect of lockdown on new cases

We can see from our first graph of daily case loads that there can be quite a bit of variability from day to day. Rather than look at the daily tally of new cases, perhaps we can take into account the overall number of new cases appearing in a 14-day sliding window. Given that symptoms from time of infection can take between 5-14 days to manifest, then a portion of daily positive cases can be the result of infection going back as far as 14-days. Taking a look at a 14-day window will also smooth out our data as we see below:

top5_PHU_cases_14d-window.png

To accomplish the above visualization, we'll need to perform some transformations on our dataset.

  1. Ensure our data is grouped by public health unit
  2. Summarise our data in sliding windows of 14-day length

We'll want to track observations by:


4.6.1 Plot our windowed data as a line graph

Now that we've generated our windowed data, let's plot the top 5 PHUs by caseload. Let's also annotate some dates from the our pandemic history:

Here's what we'll do:

  1. Plot the windowed data filtered by the top 5 PHUs
  2. Clean up the graph a little bit by "simplifying" the themes
  3. Annotate 4 dates from the pandemic timeline

5.0.0 Class summary

That's our first class! If we've made it this far, we've reviewed

  1. Foundational concepts in R
  2. Helpful functions in generating tidy data for analysis
  3. Basics of visualizations using the ggplot2

We took a "messy" dataset from the Ontario government and created a tidy data set that we were able to visualize. We took that further by transforming the data into a 14-day sliding window of mean new cases per day in each public health unit. This clarified our picture of cases and visually confirmed that spread of SARS-CoV-2 does appear to be mitigated through lockdown orders.

Next week? Getting deeper into ggplot2!


5.1.0 Weekly assignment

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12pm on the date of our next class: March 10th, 2022.


5.2.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.3.0 References

  1. Coercion from "R in a Nutshell": https://www.oreilly.com/library/view/r-in-a/9781449358204/ch05s08.html
  2. Tibbles vs data frames from "RStudio blog": https://blog.rstudio.com/2016/03/24/tibble-1-0-0/
  3. Indexing elements from a vector or list: https://cran.r-project.org/doc/manuals/R-lang.html#Indexing
  4. Change the levels of a factor: http://www.cookbook-r.com/Manipulating_data/Changing_the_order_of_levels_of_a_factor/
  5. The apply family of functions: https://www.r-bloggers.com/2015/07/r-tutorial-on-the-apply-family-of-functions/
  6. Tidy data principles: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
  7. Using the lubridate package: https://r4ds.had.co.nz/dates-and-times.html

6.0.0 Appendix 1: Instructions for installing your own software

6.1.0 Jupyter Notebooks and the R kernel

For this introductory course we will be teaching and running code for R through Jupyter notebooks. In this section we will discuss

  1. Installation of Jupyter (through Anaconda)
  2. Updating the default R package
  3. Starting up your Jupyter notebooks

6.1.1 Installing R and Jupyter Notebooks (via Anaconda3)

As of 2021-01-18, The latest version of Anaconda3 runs with Python 3.8

Download the OS-appropriate version from here https://www.anaconda.com/products/individual

6.1.2 Updating the base version of R

As of 2020-12-11, the lastest version of r-base available for Anaconda is 4.0.3 but Anaconda comes pre-installed with R 3.6.1. To save time, we will update just our r-base (version) through the command line using the Anaconda prompt. You'll need to find the menu shortcut to the prompt in order to run these commands. Before class you should update all of your anaconda packages. This will be sure to get you the latest version of Jupyter notebook. Open up the Anaconda prompt and type the following command:

conda update --all

It will ask permission to continue at some point. Say 'yes' to this. After this is completed, use the following command:

conda install -c conda-forge/label/main r-base=4.0.3=hddad469_3

Anaconda will try to install a number of R-related packages. Say 'yes' to this.

6.1.3 Loading the R-kernel for your Jupyter notebook

Lastly, we want to connect your R version to the Jupyter notebook itself. Type the following command:

conda install -c r r-irkernel

Jupyter should now have R integrated into it. No need to build an extra environment to run it.

6.1.3.1 A quick note about Anaconda environments

You may find that for some reason or another, you'd like to maintain a specific R-environment (or other) to work in. Environments in Anaconda work like isolated sandbox versions of Anaconda within Anaconda. When you generate an environment for the first time, it will draw all of its packages and information from the base version of Anaconda - kind of like making a copy. You can also create these in the Anaconda prompt. You can even create new environments based on specific versions or installations of other programs. For instance, we could have tried to make an environment for R 4.0.3 with the command

conda create -n my_R_env -c conda-forge/label/main r-base=4.0.3=hddad469_3

This would create a new environment with version 4.0.3 of R but the base version of Anaconda would retain version 3.6.1 of R. A small but helpful detail if you are unsure about newer versions of packages that you'd like to use.

Likewise, you can update and install packages in new environments without affecting or altering your base environment! Again it's helpful if you're upgrading or installing new packages and programs. If you're not sure how it will affect what you already have in place, you can just install them straight into an environment.

For more information: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#cloning-an-environment

6.1.3.2 Using the Anaconda navigator to make a Jupyter notebook

If you are inclined, the Anaconda Navigator can help you make an R environment separate from the base, but you won't be able to perform the same fancy tricks as in the prompt, like installing new packages directly to a new environment.

Note: You should consider doing this only if you have a good reason to isolate what you're doing in R from the Anaconda base packages. You will also need to have installed r-base 4.0.3 to make a new environment with it through the Anaconda navigator.

The Anaconda navigator is a graphical interface that shows all fo your pre-installed packages and give you access to installing other common programs like RStudio (we'll get to that in a moment).

You will now have an R environment where you can install specific R packages that won't make their way into your Anaconda base.

You will likely find a shortcut to this environment in your (Windows) menu under the Anaconda folder. It will look something like Jupyter Notebook (R-4-0-3)

6.1.3.3 Installing packages for your personal Jupyter Notebook

Normally I suggest avoiding installing packages through your Jupyter Notebook. Instead, if you want to update your R packages for running Jupyter, it's best to add them through either the Anaconda prompt or Anaconda navigator. Again, using the prompt gives you more options but can seem a little more complicated.

One of the most useful packages to install for R is r-essentials. Open up the Anaconda prompt and use the command: conda install -c r r-essentials. After running, the Anaconda prompt will inform you of any package dependencies and it will identify which packages will be updated, newly installed, or removed (unlikely).

Anaconda has multiple channels (similar to repositories) that exist and are maintained by different groups. These various channels port over regular R packages to a format that can be installed in Anaconda and run by R. The two main channels you'll find useful for this are the r channel and conda-forge channel. You can find more information about all of the packages on docs.anaconda.com. As you might have guessed the basic format for installing packages is this: conda install -c channel-name r-package where

conda-install is the call to install packages. This can be done in a base or custom environment -c channel-name identifies that you wish to name a specific channel to install from r-package is the name of your package and most of them will begin with r- ie r-ggplot2


6.2.0 R and RStudio

6.2.1 Installing R

As of 2020-06-25, the latest stable R version is 4.0.3:

Windows:

- Go to <http://cran.utstat.utoronto.ca/>      
- Click on 'Download R for Windows'     
- Click on 'install R for the first time'     
- Click on 'Download R 4.0.3 for Windows' (or a newer version)     
- Double-click on the .exe file once it has downloaded and follow the instructions.

(Mac) OS X:

- Go to <http://cran.utstat.utoronto.ca/>      
- Click on 'Download R for (Mac) OS X'     
- Click on R-4.0.3.pkg (or a newer version)     
- Open the .pkg file once it has downloaded and follow the instructions.


Linux:

- Open a terminal (Ctrl + alt + t)
- sudo apt-get update     
- sudo apt-get install r-base     
- sudo apt-get install r-base-dev (so you can compile packages from source)


6.2.2 Installing RStudio

As of 2021-01-18, the latest RStudio version is 1.4.1103

Windows:

- Go to <https://www.rstudio.com/products/rstudio/download/#download>     
- Click on 'RStudio 1.3.1093 - Windows Vista/7/8/10' to download the installer (or a newer version)     
- Double-click on the .exe file once it has downloaded and follow the instructions.

(Mac) OS X:

- Go to <https://www.rstudio.com/products/rstudio/download/#download>     
- Click on 'RStudio 1.3.1093 - Mac OS X 10.13+ (64-bit)' to download the installer (or a newer version)     
- Double-click on the .dmg file once it has downloaded and follow the instructions.     


Linux:

- Go to <https://www.rstudio.com/products/rstudio/download/#download>     
- Click on the installer that describes your Linux distribution, e.g. 'RStudio 1.3.1093 - Ubuntu 18/Debian 10(64-bit)' (or a newer version)     
- Double-click on the .deb file once it has downloaded and follow the instructions.     
- If double-clicking on your .deb file did not open the software manager, open the terminal (Ctrl + alt + t) and type **sudo dpkg -i /path/to/installer/rstudio-xenial-1.3.959-amd64.deb**

 _Note: You have 3 things that could change in this last command._     
 1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)     
 2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).      
 3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).

If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.


6.2.3 Getting to know the RStudio environment

RStudio is an IDE (Integrated Development Environment) for R that provides a more user-friendly experience than using R in a terminal setting. It has 4 main areas or panes, which you can customize to some extent under Tools > Global Options > Pane Layout:

  1. Source - The code you are annotating and keeping in your script.
  2. Console - Where your code is executed.
  3. Environment - What global objects you have created and functions you have written/sourced.
    History - A record of all the code you have executed in the console.
    Connections - Which data sources you are connecting to. (Not being used in this course.)
  4. Files, Plots, Packages, Help, Viewer - self-explanatoryish if you click on their tabs.

All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.

R_studio_default_layout.jpg

6.2.3.1 Source

The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. 'Untitled.R'), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.

To run your current line of code or a highlighted segment of code from the Source pane you can:
a) click the button 'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu bar,
c) use the keyboard shortcut CTRL + ENTER (Windows & Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter (not recommended).

There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.

6.2.3.2 Console

You can also type and execute your code (by hitting ENTER) in the Console when the > prompt is visible. If you enter code and you see a + instead of a prompt, R doesn't think you are finished entering code (i.e. you might be missing a bracket). If this isn't immediately fixable, you can hit Esc twice to get back to your prompt. Using the up and down arrow keys, you can find previous commands in the Console if you want to rerun code or fix an error resulting from a typo.

On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.

6.2.3.3 Environment

In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.

Objects are made by using the assignment operator <-. On the left side of the arrow, you have the name of your object. On the right side you have what you are assigning to that object. In this sense, you can think of an object as a container. The container holds the values given as well as information about 'class' and 'methods' (which we will come back to).

Type x <- c(2,4) in the Console followed by Enter. 1D objects' data types can be seen immediately as well as their first few values. Now type y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c")) in the Console followed by Enter. You can immediately see the dimension of 2D objects, and you can check the structure of data frames and lists (more later) by clicking on the object's arrow. Clicking on the object name will open the object to view in a new tab. Custom functions created in session or sourced will also appear in this pane.

The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).

In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.

The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.

6.2.3.4 Files, Plots, Packages, Help, Viewer

The Files tab allows you to search through directories; you can go to or set your working directory by making the appropriate selection under the More (blue gear) drop-down menu. The ... to the top left of the pane allows you to search for a folder in a more traditional manner.

The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.

The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.

The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.

The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.

6.2.3.5 Global Options

I suggest you take a look at Tools -> Global Options to customize your experience.

For example, under Code -> Editing I have selected Soft-wrap R source files followed by Apply so that my text will wrap by itself when I am typing and not create a long line of text.

You may also want to change the Appearance of your code. I like the RStudio theme: Modern and Editor font: Ubuntu Mono, but pick whatever you like! Again, you need to hit Apply to make changes.

That whirlwind tour isn't everything the IDE can do, but it is enough to get started.

CAGEF_services_slide.png